Photo by Nuno Alberto on Unsplash

1 Introduction

According to Worldometer, China is the most populated country in the world. Chinese character (hànzì) has been the basis of other languages, such as Japanese (kanji), Korean (hanja), and Vietnamese (chữ Hán). Nevertheless, not many people other than the Chinese citizen understand Chinese characters, mainly caused by the complexity of the characters. Therefore, in this project let’s make a machine learning model that can recognize Chinese character, especially Chinese digits. The data consists of images of handwritten Chinese characters from 0 to 10, 100, 1000, 10000 (ten thousand), and 100000000 (a hundred million).

1.1 Image Data

A digital image consists of pixels or picture elements. According to Cambridge Dictionary, pixel is the smallest unit of an image in a digital platform. Each pixel contains a value, called pixel value. A pixel value ranges between 0 to 255 describing its brightness (for grayscale images) or its color strength (for colored images).

The size of an image relies heavily on the number of pixels it has. For example, if you have an image with the size \(1000 \times 1500\), it means that your image has the width of 1000 pixels and height of 1500 pixels. The total number of pixels in that image is \(1000 \times 1500 = 1500000\) (a million and a half) pixels.

2 Prerequisites

2.1 Import Library

Let’s import the required libraries. In this project, we will be using keras library to make our machine learning model and caret to split the data and make a confusion matrix.

library(keras)
library(dplyr)
library(caret)

2.2 Load the Datasets

The data we are going to be using are taken from Kaggle; the Chinese Digit Recognizer as the main dataset, and Chinese MNIST as the indexing dataset containing labels.

chinese_mnist <- read.csv("data-input/chineseMNIST.csv") # main dataset
chinese_meta <- read.csv("data-input/chinese_mnist.csv") # index (contains labels)

# combine the metadata to the main dataset
chinese_mnist$label <- chinese_meta$code
chinese_mnist$value <- chinese_meta$value

# rearrange the dataset so "label", "value", and "character" are the first two columns
chinese_mnist <- chinese_mnist %>%
  select(label, value, character, everything())

3 Exploratory Data Analysis

3.1 Data Inspection

Let’s inspect the training data by using the head() function

head(chinese_mnist)

Explanation:

  • label = numerical code for target variable

  • value = the actual value of each character

  • character = Chinese number character in unicode

  • pixel_0, …, pixel_4096 = predictors, in pixel value

3.2 Check Missing Value

Let’s check if we have missing values in our dataset.

sum(is.na(chinese_mnist))
## [1] 0

We have no missing values, sweet!

3.3 Check Class Imbalance

Now let’s check the number of data we have for each character.

table(chinese_mnist$character)
## 
##   一   七   万   三   九   二   五   亿   八   六   十   千   四   百   零 
## 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000

The result above shows that there are no class imbalance, great!

3.4 Check Data Size

Next, let’s count the number of predictors that we have.

dim(chinese_mnist[-c(1:3)])
## [1] 15000  4096

The number 15000 shows that we have 15000 entries in our dataset, while 4096 shows that we have 4096 columns, representing the number of pixels in a picture a.k.a in an entry. Each of our data is a square picture with \(64\times64\) pixels, and \(64\times64=4096\) (you can prove it by yourself!)

3.5 Data Visualization

Let’s take a peek at the first 36 entries of our training dataset using vizTrain, a function made by Samuel Chan!

vizTrain <- function(input){
  
  dimmax <- sqrt(ncol(input[,-c(1:3)]))
  
  dimn <- ceiling(sqrt(nrow(input)))
  par(mfrow=c(dimn, dimn), mar=c(.1, .1, .1, .1))
  
  for (i in 1:nrow(input)){
      m1 <- as.matrix(input[i,4:4099])
      dim(m1) <- c(64,64)
      
      m1 <- apply(apply(m1, 1, rev), 1, t)
      
      image(1:64, 1:64, 
            m1, col=grey.colors(255), 
            # remove axis text
            xaxt = 'n', yaxt = 'n')
      text(25, 10, col="white", cex=1.2, input[i, 2])
  }
  
}

vizTrain(sample_n(chinese_mnist, 36))

4 Data Preprocessing

Let’s take a look at the unique values from the label column.

sort(unique(chinese_mnist$label))
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

We can see that the labels start from 1, this can cause dimensional mismatch error since we are using Keras, a library built for Python (Python numbering starts from 0). To avoid that error, let’s rearrange the label so it starts from 0.

chinese_mnist <- chinese_mnist %>%
  mutate(label = ifelse(label > 0, label-1, label))

sort(unique(chinese_mnist$label))
##  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14

4.1 Split Dataset

To validate the performance of our dataset, let’s split our dataset into training, validation, and testing data. This way, we can see the performance of our machine learning model when dealing with unseen data. We would split the data into 80% training, 10% validation, and 10% testing datasets.

set.seed(100)
train_index <- createDataPartition(as.factor(chinese_mnist$label), p = 0.8, list = FALSE)
data_train <- chinese_mnist[ train_index,]
data_test  <- chinese_mnist[-train_index,]

set.seed(100)
test_index <- createDataPartition(as.factor(data_test$label), p = 0.5, list = FALSE)
data_val   <- data_test[ test_index,]
data_test  <- data_test[-test_index,]

4.2 Scaling and Data Separation

Remember the theory about pixels mentioned at the beginning? It is exhaustive for our computer to calculate values ranging from 0 to 255, so it is wise to scale the data to range from 0 to 1. This is called min-max scaling. This can be done by dividing the values in our data by the maximum value, in this case 255.

To ease further data processing, let’s separate the predictors and labels and put them into new variables.

data_train_x <- data_train %>% 
  select(-c(label,character,value)) %>% # take only the predictors
  as.matrix()/255 # change the data type into matrix and do min-max scaling

data_train_y <- data_train$label # take only the labels

data_val_x <- data_val %>% 
  select(-c(label,character,value)) %>% 
  as.matrix()/255

data_val_y <- data_val$label

data_test_x <- data_test %>% 
  select(-c(label,character,value)) %>% 
  as.matrix()/255

data_test_y <- data_test$label

4.3 Change the Data Type

Keras only accept predictors in the form of array, and labels in the form of one-hot-encoded categories. One-hot encoding means that we are giving a “binary” code for each class.

Let’s change our predictors into arrays and do one-hot encoding for our labels.

# Change predictors to arrays
train_x <- array_reshape(data_train_x, dim=dim(data_train_x))
val_x <- array_reshape(data_val_x, dim=dim(data_val_x))
test_x <- array_reshape(data_test_x, dim=dim(data_test_x))

# One-hot encoding target variable
train_y <- to_categorical(data_train_y)
val_y <- to_categorical(data_val_y)
test_y <- to_categorical(data_test_y)

5 The ML Model

In this project, we are going to use a Deep Neural Network (DNN). According to IBM, DNN is a neural network with more than three layers, including the input and output layers.

To build a DNN with Keras, there are two major steps that we should do.

  1. Make a sequential model
    To make a sequential model, we can use the function keras_model_sequential()

  2. Make the neural network layers
    The most basic layer type is the dense layer. We can make a dense layer using layer_dense() function. This function has several parameters:

    • input_shape : the shape of our predictors, only used for the first hidden layer

    • units : the number of neurons in a layer

    • activation : the activation function used for a layer

    • name : (optional) the name of a layer

    Note: for the last layer (output layer), units has to be the same as the amount of target variables.

5.1 Take the Data Sizes

To ease our value assignment for parameters in building the deep learning model, let’s put our data size values into new variables.

input_dim <- ncol(train_x) # dimension of predictors
num_class <- n_distinct(data_train$label)

5.2 Build the DNN Architecture

Let’s try to build a Deep Neural Network (DNN) with only two hidden layers, with 64 and 32 neurons/nodes for the first and second hidden layers consecutively. Since we want the data to be processed non-negatively, let’s use ReLU activation function for the hidden layers, and Softmax for the output layer since we are dealing with multiclass classification case.

model1 <- keras_model_sequential() %>% 
  
  # input layer + first hidden layer
  layer_dense(input_shape = input_dim, # dimension of predictors
              units = 64, # number of neurons/nodes
              activation = "relu", # activation function
              name = "hidden_1") %>%
  
  # Dense layer
  layer_dense(units = 32,
              activation = "relu") %>% # to produce non-negative values
  
  # output layer
  layer_dense(units = num_class, # num. of target classes
              activation = "softmax", # for multiclass classification case
              name = "ouput")

5.3 Compile the Model

Since we are working with a multiclass classification case, we will use categorical cross-entropy as our loss function. Let’s try using Adam optimizer since it’s one of the most used optimizer. According to Jason Brownlee, PhD., the founder of Machine Learning Mastery,

The Adam optimization algorithm is an extension to stochastic gradient descent that has recently seen broader adoption for deep learning applications in computer vision and natural language processing.

If you are interested, you can read more about Adam optimizer straight from its inventors.

model1 %>% 
  compile(loss = loss_categorical_crossentropy(),
          optimizer = optimizer_adam(learning_rate = 0.01),
          metrics = "accuracy")

5.4 Fit the Model

Time for the model to learn! Let’s use the fit() function and pass epoch=15 to make our machine iterate its learning process 15 times and evaluate the result after every 1000 data entries with batch_size=1000. To see the accuracy on unseen data, don’t forget to pass our validation datasets into the validation_data parameter.

history <- model1 %>% 
  fit(x = train_x,
      y = train_y,
      epoch = 15,
      validation_data = list(val_x, val_y),
      batch_size = 1000)

plot(history)

5.5 Predict the Test Data

Now let’s predict the result of our testing dataset using the predict() function.

pred <- predict(model1, test_x) %>% 
  k_argmax() %>% # take the highest probability value
  as.array() %>%
  as.factor()

5.6 Evaluate the Model

Last step: evaluation! We can use the confusionMatrix function from caret library to produce a confusion matrix and accuracy value.

confusionMatrix(data=pred, reference = as.factor(data_test$label),)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14
##         0   86   0   0   0   2   0   0   0   0   0   1   9   2   4   0
##         1    0 100   6   0   0   0   1   0   1   0   0   0   0   0   0
##         2    0   0  80  10   1   1   1   0   0   0   0   0   0   0   0
##         3    0   0  11  84   0   2   5   0   0   1   0   0   1   0   0
##         4    0   0   0   0  83   0   2   0   0   2   1   0   1   2   1
##         5    0   0   1   3   0  85   4   5   0   5   0   5   2   0   2
##         6    0   0   0   3   0   1  66   5   4   9   0   2   1   0   4
##         7    0   0   0   0   3   3   1  81   1   3   1   3   1   3   3
##         8    0   0   0   0   1   0   2   0  85   1   1   0   0   1   2
##         9    0   0   0   0   1   3   7   4   1  66   0   2   3   9  10
##         10   0   0   1   0   1   1   0   1   0   0  90   0  20   1   0
##         11  12   0   0   0   5   2   3   1   1   3   0  67   0  17   1
##         12   1   0   1   0   0   1   5   1   1   0   5   2  69   2   0
##         13   1   0   0   0   1   1   2   0   2   3   1   9   0  61   0
##         14   0   0   0   0   2   0   1   2   4   7   0   1   0   0  77
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7867               
##                  95% CI : (0.7651, 0.8072)     
##     No Information Rate : 0.0667               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.7714               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.86000  1.00000  0.80000  0.84000  0.83000  0.85000
## Specificity           0.98714  0.99429  0.99071  0.98571  0.99357  0.98071
## Pos Pred Value        0.82692  0.92593  0.86022  0.80769  0.90217  0.75893
## Neg Pred Value        0.98997  1.00000  0.98579  0.98854  0.98793  0.98919
## Prevalence            0.06667  0.06667  0.06667  0.06667  0.06667  0.06667
## Detection Rate        0.05733  0.06667  0.05333  0.05600  0.05533  0.05667
## Detection Prevalence  0.06933  0.07200  0.06200  0.06933  0.06133  0.07467
## Balanced Accuracy     0.92357  0.99714  0.89536  0.91286  0.91179  0.91536
##                      Class: 6 Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity           0.66000  0.81000  0.85000  0.66000   0.90000   0.67000
## Specificity           0.97929  0.98429  0.99429  0.97143   0.98214   0.96786
## Pos Pred Value        0.69474  0.78641  0.91398  0.62264   0.78261   0.59821
## Neg Pred Value        0.97580  0.98640  0.98934  0.97561   0.99278   0.97622
## Prevalence            0.06667  0.06667  0.06667  0.06667   0.06667   0.06667
## Detection Rate        0.04400  0.05400  0.05667  0.04400   0.06000   0.04467
## Detection Prevalence  0.06333  0.06867  0.06200  0.07067   0.07667   0.07467
## Balanced Accuracy     0.81964  0.89714  0.92214  0.81571   0.94107   0.81893
##                      Class: 12 Class: 13 Class: 14
## Sensitivity            0.69000   0.61000   0.77000
## Specificity            0.98643   0.98571   0.98786
## Pos Pred Value         0.78409   0.75309   0.81915
## Neg Pred Value         0.97805   0.97252   0.98364
## Prevalence             0.06667   0.06667   0.06667
## Detection Rate         0.04600   0.04067   0.05133
## Detection Prevalence   0.05867   0.05400   0.06267
## Balanced Accuracy      0.83821   0.79786   0.87893

Seems like the accuracy is still quite low. Let’s improve our model by using convolutional layer in our Deep Neural Network.

6 Convolutional Neural Network

Convolutional Neural Network (CNN) is a type of neural network that uses convolution layer. CNN is highly popular to be used for image data, as stated by IBM,

Convolutional neural networks are distinguished from other neural networks by their superior performance with image, speech, or audio signal inputs.

A CNN typically consists of these layers.

  1. Convolutional layer
    A convolution layer works by implementing a filter towards the image. It works by multiplying the pixel values in the image matrix with the values in the filter matrix. The filter then shifts until all pixel values are multiplied.

    Here is an animated version of how convolutional layer works.

    A convolution layer essentially extract features from your image, like horizontal lines, vertical lines, edges, etc. Our machine learn these features in the training process, and try to identify these patterns when given unseen data.

  2. Pooling layer
    Pooling layer is used to decrease computational effort without losing important information. Pooling layer works similarly to convolutional layer, except that the filter pooling layer does not implement multiplication. Instead, it applies an aggregation function to the captured matrix. There are two types of pooling:

    1. Max pooling
      Works by taking the maximum value of the captured matrix

    2. Average pooling
      Works by taking the average value of the captured matrix

  3. Flatten layer
    As the name suggests, flatten layer flattens the data from a matrix to a very long vector so that a dense layer can process it. Layers processing the data after the flatten layer are usually called fully-connected layers.

Here is a diagram that shows the how layers mentioned above are connected.

6.1 Data Preprocessing: Change the Dimension

To use convolution layer, we must go back to the step where we change the data type. The cell below looks almost identical to the cell in Change the Data Type section.

# Change predictors to arrays
train_x <- array_reshape(data_train_x, dim=c(dim(data_train_x)[1],64,64,1))
val_x <- array_reshape(data_val_x, dim=c(dim(data_val_x)[1],64,64,1))
test_x <- array_reshape(data_test_x, dim=c(dim(data_test_x)[1],64,64,1))

# One-hot encoding target variable
train_y <- to_categorical(data_train_y)
val_y <- to_categorical(data_val_y)
test_y <- to_categorical(data_test_y)

Notice the difference? Correct! This time we need to change the dim parameter inside the array_reshape() function. You can see the difference better in the raw text chunk below.

# Without convolution layer
train_x <- array_reshape(data_train_x, dim=dim(data_train_x))
val_x <- array_reshape(data_val_x, dim=dim(data_val_x))
test_x <- array_reshape(data_test_x, dim=dim(data_test_x))

# With convolution layer
train_x <- array_reshape(data_train_x, dim=c(dim(data_train_x)[1],64,64,1))
val_x <- array_reshape(data_val_x, dim=c(dim(data_val_x)[1],64,64,1))
test_x <- array_reshape(data_test_x, dim=c(dim(data_test_x)[1],64,64,1))

Now what does c(dim(data_train_x)[1],64,64,1) mean? Explanation:

  • dim(data_train_x) = the number of rows in the data_train_x dataset

  • 64,64 = the dimension/size of the image

  • 1 = the number of color channel(s), 1 for BW and 3 for RGB

6.2 Build the CNN Architecture

Let’s just redo assigning data sizes into variables.

input_dim <- ncol(train_x)
num_class <- n_distinct(data_train$label)

To make a convolution layer, we can use the layer_conv_2d() function. Don’t forget to fill the input_shape parameter for the first layer of our network. This time we pass the value c(64,64,1) which means our data is a two-dimensional array with the size of \(64\times64\) and one channel.

model2 <- keras_model_sequential() %>% 
  
  # Convolutional layer
  layer_conv_2d(input_shape = c(64,64,1),
                filters = 16,
                kernel_size = c(3,3), # 3 x 3 filters
                activation = "relu") %>%
  
  # Max pooling layer
  layer_max_pooling_2d(pool_size = c(2,2)) %>%
  
  # Flattening layer
  layer_flatten() %>%
  
  # Dense layer
  layer_dense(units = 32,
              activation = "relu") %>% 
  
  # output layer
  layer_dense(units = num_class,
              activation = "softmax",
              name = "ouput")

6.3 Compile and Fit the Model

Let’s just make the rest of things the same for the next steps 😃

model2 %>% 
  compile(loss = loss_categorical_crossentropy(),
          optimizer = optimizer_adam(learning_rate = 0.01),
          metrics = "accuracy")
history <- model2 %>% 
  fit(x = train_x,
      y = train_y,
      epoch = 15,
      validation_data = list(val_x, val_y),
      batch_size = 1000,
      )

plot(history)

6.4 Prediction & Evaluation

Let’s predict our CNN model with the unseen test data.

pred <- predict(model2, test_x) %>% 
  k_argmax() %>%
  as.array() %>% 
  as.factor()
confusionMatrix(data=pred, reference = as.factor(data_test$label))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14
##         0  97  0  0  0  0  1  0  0  0  1  0  0  0  0  0
##         1   0 99  6  2  0  0  1  0  0  0  0  0  0  0  0
##         2   0  1 83 10  0  0  2  0  0  0  0  0  1  0  0
##         3   0  0 11 88  0  3  0  0  0  0  0  0  0  0  0
##         4   0  0  0  0 92  1  1  1  0  1  0  0  0  0  1
##         5   0  0  0  0  0 93  0  2  0  1  0  4  0  0  0
##         6   0  0  0  0  0  0 84  3  0  2  1  0  1  3  0
##         7   0  0  0  0  2  1  0 85  0  2  0  1  1  0  0
##         8   0  0  0  0  0  0  1  0 96  2  0  0  0  0  2
##         9   1  0  0  0  2  0  1  5  1 85  0  0  2  0  6
##         10  0  0  0  0  0  0  1  0  0  0 95  1  6  1  0
##         11  1  0  0  0  3  1  1  1  1  1  0 88  1  5  0
##         12  0  0  0  0  0  0  1  1  0  0  2  0 85  0  0
##         13  1  0  0  0  0  0  7  0  1  2  2  6  3 91  0
##         14  0  0  0  0  1  0  0  2  1  3  0  0  0  0 91
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9013               
##                  95% CI : (0.8851, 0.916)      
##     No Information Rate : 0.0667               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.8943               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.97000  0.99000  0.83000  0.88000  0.92000  0.93000
## Specificity           0.99857  0.99357  0.99000  0.99000  0.99643  0.99500
## Pos Pred Value        0.97980  0.91667  0.85567  0.86275  0.94845  0.93000
## Neg Pred Value        0.99786  0.99928  0.98788  0.99142  0.99430  0.99500
## Prevalence            0.06667  0.06667  0.06667  0.06667  0.06667  0.06667
## Detection Rate        0.06467  0.06600  0.05533  0.05867  0.06133  0.06200
## Detection Prevalence  0.06600  0.07200  0.06467  0.06800  0.06467  0.06667
## Balanced Accuracy     0.98429  0.99179  0.91000  0.93500  0.95821  0.96250
##                      Class: 6 Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity           0.84000  0.85000  0.96000  0.85000   0.95000   0.88000
## Specificity           0.99286  0.99500  0.99643  0.98714   0.99357   0.98929
## Pos Pred Value        0.89362  0.92391  0.95050  0.82524   0.91346   0.85437
## Neg Pred Value        0.98862  0.98935  0.99714  0.98926   0.99642   0.99141
## Prevalence            0.06667  0.06667  0.06667  0.06667   0.06667   0.06667
## Detection Rate        0.05600  0.05667  0.06400  0.05667   0.06333   0.05867
## Detection Prevalence  0.06267  0.06133  0.06733  0.06867   0.06933   0.06867
## Balanced Accuracy     0.91643  0.92250  0.97821  0.91857   0.97179   0.93464
##                      Class: 12 Class: 13 Class: 14
## Sensitivity            0.85000   0.91000   0.91000
## Specificity            0.99714   0.98429   0.99500
## Pos Pred Value         0.95506   0.80531   0.92857
## Neg Pred Value         0.98937   0.99351   0.99358
## Prevalence             0.06667   0.06667   0.06667
## Detection Rate         0.05667   0.06067   0.06067
## Detection Prevalence   0.05933   0.07533   0.06533
## Balanced Accuracy      0.92357   0.94714   0.95250

Fantastic, the accuracy of our model just went 12% higher!

7 Conclusion

Due to the complexity, Chinese characters are not easy to learn. Yet, in this project, we have successfully made our machine learn from images of handwritten Chinese digits with Deep Neural Network and Convolutional Neural Network. With ordinary DNN (using only dense layers), we’ve achieved a test accuracy of around 78%, while with CNN, we’ve achieved 90% test accuracy. This shows that CNN improves model training from image data, as stated by the references mentioned above. Further application of CNN for image data is image recognition. Digit and letter recognition, for example, can be developed into language translation from an image.